显着的波高预测是海洋数据分析中的关键问题。预测明显的波高对于估计波的能量产生至关重要。此外,及时对大浪的预测对于确保海上行动的安全至关重要,例如船只的通道。我们将预测显着波高的极端值作为超出概率预测问题的任务进行了框架。因此,我们旨在估计显着波高将超过预定义阈值的概率。通常使用概率二进制分类模型来解决此任务。相反,我们提出了一种基于预测模型的新方法。该方法利用了即将到来的观测值的预测来根据累积分布函数估算超出概率。我们使用来自加拿大哈利法克斯海岸的浮标的数据进行了实验。结果表明,提出的方法比最先进的方法要好于超出概率预测。
translated by 谷歌翻译
时间序列代表一组随时间收集的观察结果。通常,这些观察结果以均匀的采样频率(例如,每日)捕获。当在不均匀的时间间隔中观察到数据点时,时间序列被称为不规则或间歇性。在这种情况下,最常见的解决方案是重建时间序列以使其定期,从而消除其间歇性。我们假设,在不规则时间序列中,收集每个观察的时间可能有助于总结数据的动态并提高预测性能。我们通过开发一种新颖的自动特征工程框架来研究这个想法,该框架专注于从该角度来提取信息,即,收集每个实例时。我们研究了这些信息的有价值如何通过将其集成在时间序列预测工作流程中,并调查它如何比较或补充定期时间序列预测的最先进方法。最后,我们通过提供一种新颖的框架,该框架可以从前忽略的角度处理时间序列的特征工程。我们表明,我们的方法有可能进一步提取有关时间序列的更多信息,这些信息显着提高了预测性能。
translated by 谷歌翻译
In this work a novel recommender system (RS) for Tourism is presented. The RS is context aware as is now the rule in the state-of-the-art for recommender systems and works on top of a tourism ontology which is used to group the different items being offered. The presented RS mixes different types of recommenders creating an ensemble which changes on the basis of the RS's maturity. Starting from simple content-based recommendations and iteratively adding popularity, demographic and collaborative filtering methods as rating density and user cardinality increases. The result is a RS that mutates during its lifetime and uses a tourism ontology and natural language processing (NLP) to correctly bin the items to specific item categories and meta categories in the ontology. This item classification facilitates the association between user preferences and items, as well as allowing to better classify and group the items being offered, which in turn is particularly useful for context-aware filtering.
translated by 谷歌翻译
Bi-encoders and cross-encoders are widely used in many state-of-the-art retrieval pipelines. In this work we study the generalization ability of these two types of architectures on a wide range of parameter count on both in-domain and out-of-domain scenarios. We find that the number of parameters and early query-document interactions of cross-encoders play a significant role in the generalization ability of retrieval models. Our experiments show that increasing model size results in marginal gains on in-domain test sets, but much larger gains in new domains never seen during fine-tuning. Furthermore, we show that cross-encoders largely outperform bi-encoders of similar size in several tasks. In the BEIR benchmark, our largest cross-encoder surpasses a state-of-the-art bi-encoder by more than 4 average points. Finally, we show that using bi-encoders as first-stage retrievers provides no gains in comparison to a simpler retriever such as BM25 on out-of-domain tasks. The code is available at https://github.com/guilhermemr04/scaling-zero-shot-retrieval.git
translated by 谷歌翻译
There is an increasing need in our society to achieve faster advances in Science to tackle urgent problems, such as climate changes, environmental hazards, sustainable energy systems, pandemics, among others. In certain domains like chemistry, scientific discovery carries the extra burden of assessing risks of the proposed novel solutions before moving to the experimental stage. Despite several recent advances in Machine Learning and AI to address some of these challenges, there is still a gap in technologies to support end-to-end discovery applications, integrating the myriad of available technologies into a coherent, orchestrated, yet flexible discovery process. Such applications need to handle complex knowledge management at scale, enabling knowledge consumption and production in a timely and efficient way for subject matter experts (SMEs). Furthermore, the discovery of novel functional materials strongly relies on the development of exploration strategies in the chemical space. For instance, generative models have gained attention within the scientific community due to their ability to generate enormous volumes of novel molecules across material domains. These models exhibit extreme creativity that often translates in low viability of the generated candidates. In this work, we propose a workbench framework that aims at enabling the human-AI co-creation to reduce the time until the first discovery and the opportunity costs involved. This framework relies on a knowledge base with domain and process knowledge, and user-interaction components to acquire knowledge and advise the SMEs. Currently,the framework supports four main activities: generative modeling, dataset triage, molecule adjudication, and risk assessment.
translated by 谷歌翻译
Robust 2004是一种信息检索基准,其每个查询的大量判断使其成为可靠的评估数据集。在本文中,我们介绍了Mrobust04,这是一种多语言版本的robust04,使用Google Translate翻译为8种语言。我们还提供了该数据集上三个不同多语言检索器的结果。该数据集可在https://huggingface.co/datasets/unicamp-dl/mrobust上获得
translated by 谷歌翻译
用于训练机器学习算法的现实世界图像通常是非结构化且不一致的。分析和标记这些图像的过程可能是昂贵的,并且容易出错(也有差距和法律难题)。但是,正如我们在本文中所证明的那样,与现实世界中无法区分的准确图形图像的潜力在机器学习范式中具有许多好处。一个这样的例子是来自广播服务(电视和其他流媒体来源)的足球数据。足球比赛通常是从多个来源(相机和电话)和决议中记录的,更不用说,视觉细节和其他人工制品(例如模糊,风化和照明条件)的遮挡,使其难以准确识别功能。我们演示了一种能够使用生成的标记和结构化图像来克服这些局限性的方法。生成的图像能够模拟多种视图和条件(包括噪声和模糊),这些视图和条件可能只会在现实世界中偶尔出现,并且使机器学习算法难以“应对”实际数据中的这些不可预见的问题。这种方法使我们能够快速训练并准备一种可靠的解决方案,该解决方案可从现实世界足球比赛来源中准确提取功能(例如,空间位置,球场位置,球员位置和摄像头FOV),以用于分析目的。
translated by 谷歌翻译
Wearable sensor-based human activity recognition (HAR) has emerged as a principal research area and is utilized in a variety of applications. Recently, deep learning-based methods have achieved significant improvement in the HAR field with the development of human-computer interaction applications. However, they are limited to operating in a local neighborhood in the process of a standard convolution neural network, and correlations between different sensors on body positions are ignored. In addition, they still face significant challenging problems with performance degradation due to large gaps in the distribution of training and test data, and behavioral differences between subjects. In this work, we propose a novel Transformer-based Adversarial learning framework for human activity recognition using wearable sensors via Self-KnowledgE Distillation (TASKED), that accounts for individual sensor orientations and spatial and temporal features. The proposed method is capable of learning cross-domain embedding feature representations from multiple subjects datasets using adversarial learning and the maximum mean discrepancy (MMD) regularization to align the data distribution over multiple domains. In the proposed method, we adopt the teacher-free self-knowledge distillation to improve the stability of the training procedure and the performance of human activity recognition. Experimental results show that TASKED not only outperforms state-of-the-art methods on the four real-world public HAR datasets (alone or combined) but also improves the subject generalization effectively.
translated by 谷歌翻译
使用机器学习算法从未标记的文本中提取知识可能很复杂。文档分类和信息检索是两个应用程序,可以从无监督的学习(例如文本聚类和主题建模)中受益,包括探索性数据分析。但是,无监督的学习范式提出了可重复性问题。初始化可能会导致可变性,具体取决于机器学习算法。此外,关于群集几何形状,扭曲可能会产生误导。在原因中,异常值和异常的存在可能是决定因素。尽管初始化和异常问题与文本群集和主题建模相关,但作者并未找到对它们的深入分析。这项调查提供了这些亚地区的系统文献综述(2011-2022),并提出了共同的术语,因为类似的程序具有不同的术语。作者描述了研究机会,趋势和开放问题。附录总结了与审查的作品直接或间接相关的文本矢量化,分解和聚类算法的理论背景。
translated by 谷歌翻译
现代的3D计算机视觉利用学习来增强几何推理,将图像数据映射到经典结构,例如成本量或外观限制,以改善匹配。这些体系结构根据特定问题进行了专门化,因此需要进行大量任务的调整,通常会导致域的泛化性能差。最近,通才变压器架构通过编码几何学先验作为输入而不是执行约束,在诸如光流和深度估计等任务中取得了令人印象深刻的结果。在本文中,我们扩展了这一想法,并建议学习一个隐式,多视图一致的场景表示,并在增加视图多样性之前引入了一系列3D数据增强技术作为几何感应。我们还表明,引入视图合成作为辅助任务进一步改善了深度估计。我们的深度磁场网络(定义)实现了最新的目的,可以实现立体声和视频深度估计,而无需明确的几何约束,并通过广泛的边距改善了零局部域的概括。
translated by 谷歌翻译